A novel computational strategy for defining the minimal protein molecular surface representation

Most proteins perform their biological function by interacting with one or more molecular partners. In this respect, characterizing local features of the molecular surface, that can potentially be involved in the interaction with other molecules, represents a step forward in the investigation of the mechanisms of recognition and binding between molecules. Predictive methods often rely on extensive samplings of molecular patches with the aim to identify hot spots on the surface. In this framework, analysis of large proteins and/or many molecular dynamics frames is often unfeasible due to the high computational cost. Thus, finding optimal ways to reduce the number of points to be sampled maintaining the biological information (including the surface shape) carried by the molecular surface is pivotal. In this perspective, we here present a new theoretical and computational algorithm with the aim of defining a set of molecular surfaces composed of points not uniformly distributed in space, in such a way as to maximize the information of the overall shape of the molecule by minimizing the number of total points. We test our procedure’s ability in recognizing hot-spots by describing the local shape properties of portions of molecular surfaces through a recently developed method based on the formalism of 2D Zernike polynomials. The results of this work show the ability of the proposed algorithm to preserve the key information of the molecular surface using a reduced number of points compared to the complete surface, where all points of the surface are used for the description. In fact, the methodology shows a significant gain of the information stored in the sampling procedure compared to uniform random sampling.


Introduction
Interactions between biomolecules play a fundamental role for most cellular processes, from DNA replication to protein degradation [1][2][3]. Since it has been estimated that over 80% of proteins operate in molecular complexes [4], many computational approaches have been developed for investigating and/or predicting protein-protein interactions (PPIs) in physiology and pathology [5][6][7][8][9][10]. In this scenario, the shape complementarity between interacting regions plays a fundamental role because at short distances, it dictates the stabilizing role exerted by van der Waals interactions: thus, the shape of local surface regions has a key role in predicting the ability of a protein to bind its molecular partner [11]. From this point of view, the definition of the molecular surface plays an important role [12][13][14][15][16][17]. Many algorithms have been designed in order to build the molecular surface at different resolution scales partner [18], typically considering a probe radius up to 5 Å, which is approximately the size of a large solvent molecule. The variation in the size of the probe molecule affects the level of detail of the molecular surface [12] as much as the density of points used in its representation. Computational methods that aim at the identification of hot-spots and/or binding regions often perform extensive samplings of molecular surface portions, or patches [8,19,20]. Intuitively, the more the surface is sampled the more the reaching for hot-spot is accurate. Similarly, the higher the number of different points used to represent the surface the higher the level of detail of the molecular shape.
However, time and computational costs limit both the resolution of the surface and the number of patches that can be sampled, especially for large protein complexes and/or in analyses that involve a big set of surfaces, like for example molecular dynamics data. Although several methods have been developed over the past decades to optimize the computational calculation of any molecular surface, the search for optimal techniques still remains a challenge today [21][22][23].
Here, we present a new method for reducing the molecular surface, maximizing the overall information of the protein shape, and minimizing, in principle, the number of total surface points. The basic idea of the proposed new algorithm is the selection of molecular surface points according to the local roughness (that is the degree of complexity of shape of each surface region): increasing the sampling in high roughness regions and decreasing the sampling in the more flat ones. In particular, we define a sampling probability that depends on the local roughness of the surface. We discuss the performance of the algorithm as a function of several parameters, evaluating the ability of the algorithm to describe locally portions of the surface in comparison to uniform random sampling. In order to evaluate the ability of the reduced molecular surface to capture the information of the complete surface, we define a descriptor based on the local characterization of the molecular surface patch shape. Indeed, to evaluate the resulting representation of the surface, we compare the shape similarity between a portion of the surface obtained from the complete surface of the protein and the same portion of the surface obtained via our algorithm.
In order to study the gain of the proposed algorithm in terms of information preserved in the reduced version of the molecular surface, we compare the description of the complete surface also with random sampling, which represents the approach of trivial reduction of each molecular surface by decreasing, without criteria, the density of the number of points in space.
While part of the information concerning the interaction is encoded in various chemical features, it has already been discussed how the shape of local surface regions has a key role in predicting a protein ability to bind its molecular partner [11,24]. This is at the core of the development of the 2D Zernike polynomial expansion [8] as a method to find the protein-protein binding regions. This method could in principle be adapted to other properties (such as electrostatics or hydrophobicity) besides curvature since they can be described with numerical values that can be assigned to each surface point [25,26], and the Zernike expansion can be applied to any function. Nonetheless, in this work the algorithm is used to maximize the information concerning only the overall shape of the molecule. More specifically, the method is based on the description of each portion of the molecular surface by expanding the wellexposed molecular surface patches in terms of 2D Zernike polynomials, in order to be able to measure the geometrical similarity or complementarity between interacting proteins, with a lower computational cost than its 3D version [27][28][29][30][31][32].
We show that our proposed sampling reduces the number of considered points, minimizing the loss of information about the protein surface shape. To propose a generically valid set of parameters, we estimate the optimal sets for 70 proteins and compute their average values, which are: β = 0, γ = 4, δ = 5. By comparing the percentage of points sampled with each surface's best set and with these average values, we show that these last ones mostly results in a number of points close to those selected with the specific optimal parameters. Finally, to test the general validity of our method in a biologically relevant case (i.e. the protein-protein interaction), we study a subset of protein complexes among the ones used in [8] and evaluate the performance of the Zernike-based method in recognising the interacting regions of protein structures. In particular, we analyze the ability of the proposed method to characterize binding sites of these protein complexes, the structure of which has been experimentally resolved. For this purpose, we select the interacting regions of four protein surfaces and sample them with our proposed algorithm.
By analyzing how the total and sampled points distribute in the two-dimensional principal components (PCs) of the space spanned by the vectors of the Zernike invariants associated to each surface patch, we show that the subsets of points sampled using the best parameters' set fully cover the same portion of the essential space occupied by the whole bunch of surface points.

Protein structure and computation of the molecular surfaces
The protein used in the first part of this work is TDP-43, residues 209-269 (PDB id: 4bs2). The protein was equilibrated using GROMACS 2019.3 [33], and is going to be called A1 in the following discussion. For the optimization and testing of the parameters' selection we used 70 proteins complexes experimentally solved in complex in X-ray crystallography, taken from [8].
The solvent-accessible surfaces are calculated with DMS [34], with a density of 5 points per Å 2 and a water probe radius of 1.4 Å. For each surface point, we calculated the unit normal vector with the flag −n.

Zernike 2D procedure
The molecular surface we obtain with DMS is represented by a set of points in the threedimensional space. The first step of the 2D Zernike algorithm is to select from the surface a patch S, defined as the set of surface points included in a spherical region having radius R zernike = 6 Å and centered on one point of the surface. The points not directly connected to that region of the surface (for example coming from a protuberance included in the sphere) are removed. Once the patch has been selected, the mean vector of the normal vectors of the patch points is computed and oriented along the z-axis. Thus, given a point C on the z-axis, the angle θ is defined as the largest angle between the z-axis and a secant connecting C to any point of the surface S. C is then set so that θ = 45˚and each surface point is labeled with its distance r from C. As a next step, a square grid that associates each pixel with the mean r value calculated on the points inside it is built. The gap of pixels where no point of the surface has been projected are filled by using the average value of the surrounding pixels.
Such a 2D function can now be expanded on the basis of the Zernike polynomials.
Indeed, each function of two variables f(r, ψ) defined in polar coordinates inside the region of the unitary circle (r < 1) can be decomposed in the Zernike basis as f ðr; cÞ ¼ Since for each couple of polynomials, it is true that the complete set of polynomials forms a basis, and knowing the set of complex coefficients c n 0 m allows for a univocal reconstruction of the original patch. The norm of each coefficient z n 0 m = |c n 0 m | constitutes one of the Zernike invariant descriptors.

Similarity evaluation
Once a patch is represented in terms of its Zernike descriptors, the shape relation between that patch and another one can be simply measured as the Euclidean distance between the invariant vectors. The relative orientation of the patches before the projection in the unitary circle must be considered. In fact, if we search for similar regions we must compare patches that have the same orientation once projected in the 2D plane, i.e. the solvent-exposed part of the surface must be oriented in the same verse for both patches, for example along the positive zaxis. On the other hand, if we are looking for complementarity (as for searching the binding regions in a complex) the patches have to be oriented in opposite verses.

Identification of the binding regions and principal component analysis
For each of the 70 protein complexes sampled from [8] and reported in Table 2, the binding regions between the couples of proteins were defined as the sets of surface points whose distance with respect to the partner was lower than 3 Å. Finally, for each set, we compute its geometrical center and then consider as binding region the patch with radius 6 Å centered on it.
From each point of the identified binding regions, we evaluated the set of Zernike invariants and performed a principal component analysis (PCA) in which the starting matrix consisted of 121 columns (the number of invariants) and a number of rows equal to the number of points of the patch.

Results
The proposed method consists of a sampling suitably designed for selecting with greater probability surface points belonging to regions of high roughness. Indeed, when choosing where to select the patches, a compromise between a too sparse sampling (which would loose some regions of the surface) and a too dense one (which would add no information and would instead increase the computational time) must be found.
To this end, we define a function used for sampling the original surface: we use a method based on the 2D formalism of the Zernike polynomials to evaluate the ability of the algorithm to correctly approximate local regions of molecular surfaces with respect to a uniform random sampling. In the following Sections, we describe in detail the designed algorithm, the method we used as test based on the 2D Zernike formalism, and the results obtained by varying the parameters.
For an in-depth study of the proposed algorithm we consider different cases of application to a single surface (namely a fragment of TDP-43 that we are going to call A1). As a next step, to make this method applicable to a variety of surfaces, we apply it on a set of 70 protein surfaces. For each surface we test different parameters' combinations, selecting the best ones. This allows us to propose a set of mean parameters to be used for any given protein.

New roughness-dependent sampling
To begin with, we numerically represent the molecular surface with a set of N points in the 3D space (the discretization of the surface). For each point i, we evaluate the exiting normal vector, � v i , to the surface, originating from i. Next, we evaluate the local roughness of the molecular surface by looking at the relative orientation of the normal vectors with respect to each point i.
To do so, starting from each point i, we define a patch including all the surface points within a sphere of radius R patch centered on the point i. We calculate the roughness of each patch as the mean of the cosines of the angles formed by the normal vectors associated to each of the n p points of the patch and the average normal vector: Fig 1A shows the molecular surface for a case-of-study protein, TDP-43 (PDB id: 4BS2) residues 209-296, colored according to the local roughness. Being a mean of cosines, the roughness ranges from zero to one (see Fig 1B). When the considered patch is plane, the mean value of the cosine between the normal vectors of each point i of the surface and the mean normal vector of that patch, R i , is close to one, while lower values of R i indicate rougher patches.
Then, we associate to each point j in the patch centered on a point i, the probability to be accepted for the sampling, defined as: where r i,j is the distance of the point j from the center i, and α, β, γ and δ are parameters that can be optimized to yield different sampling scenarios. For instance, when β = 0, γ = 0, and δ = 0, we obtain a uniform random extraction, whose total number of points depends on α: this parameter determines the fraction (over all the surface's points) of considered points.
In general, when a patch i has a high roughness, more points are needed to describe it. On the other hand when it is more plane we need fewer points, and indeed ð1 À R i Þ becomes smaller. Finally, the center of a patch is always selected, but then to capture the surface's irregularities we can use as centers for the Zernike patches the points further away from it, i.e. the ones with a high value of r i,j . By elevating this term to the ð1 þ R i Þ we are changing the distribution of sampled points in each patch as a function of that patch roughness.

Selection of the sampling parameters
Here, we describe the method used to determine how effectively, starting from a sampling (determined by the combination of the parameters α, β, γ, and δ), we can describe the original surface.
1. For each combination of these four parameters, we sample from the original total surface a number n S of points and we define a new surface determined by these n S points. Then, we extract from the original total surface again n S points, but this time with a uniform distribution (or random extraction).
2. We select from the total surface n test points, and define around each of them a region with radius R = 6 Å.

PLOS ONE
3. Next, we associate to each one of these points, j, three vectors: z tot (j), z S (j) and z R (j). z tot (j) contains the Zernike descriptors that describe that patch as defined by all the total points included in it, z S (j) describes the patch as defined by the sampled points included in it and z R (j) describes the patch as defined by the included randomly extracted points.
4. For each of the n test patches we compute the distances Z t−S (j) = z tot (j) − z S (j) and Z t−R (j) = z tot (j) − z R (J). We average all the obtained Z t−S (j) and Z t−R (j), and obtain respectively the values Z t−S and Z t−R . Since we are considering the description given by all the original points as our "ideal", for a good sampling we expect the value of Z t−S to be small, and in particular smaller than Z t−R .
5. Finally, we compute the difference d = Z t−R − Z t−S . The best sampling for a surface should result in the maximization of d.
Our main objective is to describe the original surface as precisely as possible with a minimum number of points.

Selection starting from a pre-determined n S
By varying the four parameters we can observe a wide range of n S . Fig 2 shows an example of how this number changes for changing combinations of parameters. Table 1 shows, for each arbitrarily determined range of n S , the combination of parameters that results in the sampling with the highest d.

Selection in the limit cases
Similarly, we can fix a priori the values of some of the parameters to study some particular forms of the sampling distribution p(j).
In the following we analyze three peculiar cases: 1. No dependency on the roughness: when β = 0 and γ = 0, p(j) does not depend on R i , but only on the distance between the center of the considered patch and the considered point.
2. No dependency on the distance from the patch center: when γ = 0 and δ = 0, p(j) depends only on R i .
3. Steepest sampling function: when β = 1 and γ = 0 p(j) has the fastest variation as a function of δ. In this situation the best sampling (given these fixed parameters) should be clearly distinguishable. Fig 3 shows the analyses accomplished in each of the three limit cases. The first useful information is how the number of selected points changes according to the parameters value, because this is a first indication of the performance of that sampling function. Then we can optimize the non-fixed parameters, by looking for the maximization of d. Once the best parameters combination has been detected for each case, the differences between the three situations can be better appreciated by looking at what happens for different roughness of the patches. Fig 3 shows for each case the box plot of Z R tÀ S and Z R tÀ R , divided according to the roughness of the patches they describe. Z R tÀ S and Z R tÀ R are defined by averaging Z t−S (j) and Z t −R (j) only on the points whose R j falls in a specific interval. The considered patches are extracted (according to their roughness) from the points obtained with the best sampling, in each limit case, of the considered surface.

Selection without restriction
When there is no restriction on the number of sampled points or on the parameters' values, we can fix α = 1, since it is a multiplicative parameter that causes no variation of the distribution of sampled points between patches with different roughness values. When α < 1 we are removing from each patch the same percentage of points, so if we have no requirements about the final n S , this uniform removal of points can be avoided.
Consequently, we are interested in finding the combination of β, γ, and δ that results in the highest d. While it is true that a good sampling should result in a high d combined with a low n S , the weights that these two components should have in an optimization function will change according to the application and cannot be generalized.   ]. The last column shows, for the best parameters combination in each limit case, the box plot for different mean R i of Z R tÀ S (in red) and Z R tÀ R (in blue). These plots refer again to the application of this sampling method to the surface of A1. https://doi.org/10.1371/journal.pone.0266004.g003

PLOS ONE
Fig 4 shows how the best parameters combination is selected: for each β we individuate the best values for γ and δ, then to determine β we select among these pairings the one corresponding to the absolute highest d. Once the best parameters combination has been detected, we can better understand what exactly makes the sampling better than a random extraction from the total surface of the same number of points. Fig 5 shows for example the box plot of Z R tÀ S and Z R tÀ R divided according to the roughness of the patches they describe. As expected, we can see that plane patches (whose R i tends to one) are reconstructed nearly equally with the sampling and the random extraction, whereas for increasing roughness (R i approaching zero) the sampling results in Zernike vectors much closer to the ones obtained from the total surface, compared to the random case. That said, it must be remembered that with the sampling we are concentrating the n S points on the roughest zones. This means that with the sampling the plane patches are described with fewer points than in the random case: the improvement for the rough patches comes from a better surface reconstruction, while for plane patches it derives from the fact that a smaller number of points is used to reach the same reconstruction.
Finally, Fig 6 shows an example of how the surface is described when all its points are considered versus when only some subsets -including increasing number of points-are selected, with the sampling or randomly. When a small subset of points is used to reconstruct the surface, the difference between the sampling or a random extraction of the same number of points is clearly distinguishable. The more points are considered, the more the two selections become similar.

Optimization of the sampling parameters selection
To make this sampling method of general utility, we identify the best combination of parameters for different cases. Our aim is to provide the user with a list of possible sets of parameters averaged over a large set of different proteins. In particular, we selected 70 protein-protein The colouring is given by the β value (as described by the color-bar). The plotted surface is obtained from the interpolation of these points, and shows, for all the δ-γ combinations the value of β that will result in the best sampling. The maximum of this surface (red points) corresponds to the best sampling parameters. https://doi.org/10.1371/journal.pone.0266004.g004

PLOS ONE
complexes from [8] and for each of them only a single protein structure is considered for this analysis.
Here, we report for each parameter the mean and standard error over the mean: β = 0.026 ±0.019, γ = 0.026±0.019 and δ = 0.026±0.019. Interestingly, the β parameter appears to be compatible with zero. Finally, we compare the sampling obtained with the optimal parameters and the one with the average parameters. The results are shown in Table 2.
Looking at the difference between the percentage of sampled points in the optimal and averaged case (fifth column in Table 2) it seems that the latter is in many case a good approximation of the former. We note that since the sampling best parameters depend on the geometry of the surface, the optimal solution could be to associate the optimal set of parameters to each class of proteins, even if we expect it to be a not trivial relationship worth of future investigations. Comparison between the best sampling and the random selection of points for patches with different roughness. A) On the left, an example of how a rough patch is represented with all the points in the surface (whole patch), with only the points resulting from the sampling (sampled points) and with randomly extracted points (random points). On the right, the same three representation cases for a plane patch. B) In red, the box plot for different ranges of patches' roughness of Z R tÀ S . In blue, the box plot for different ranges of patches' roughness of Z R tÀ R . This is in the case of a sampling with parameters α = 1, β = 0, γ = 6 and δ = 0, whose combination results in the highest value of d for the A1 surface. https://doi.org/10.1371/journal.pone.0266004.g005

Focus on protein-protein binding regions
The proposed sampling method is in principle applicable to any molecular surface, for example for the study of protein-protein interactions. In this case, to be of practical utility, the procedure should be able to characterize the binding regions of protein complexes efficiently and with a low computational cost, for evaluating the binding properties. To test whether our proposed algorithm gives a set of points which optimizes the information of the local geometry of the interacting regions, we look at the preservation of information in terms of Zernike vectors. To do so, we select the interacting regions of four protein surfaces, which are randomly selected from the dataset reported in [8], and define them as a set of vectors. Each vector corresponds to a point i of the interacting molecular surface and is composed of the Zernike coefficients obtained by centering the patch on that point. Next, we select only a subset of the points belonging to the interacting surfaces, in agreement with the proposed sampling algorithm, Fig 6. Visualization of the 3D surfaces reconstruction. A) 3D reconstruction of the A1 surface from all its surface points. B) The three columns depict the reconstruction of the same surface, with an increasing sampling density. In each column, the first row shows the reconstruction with a subset of the original points selected with the sampling, whereas the second row shows the reconstruction with a subset that counts the same number of points selected with the sampling, but in this case randomly extracted. https://doi.org/10.1371/journal.pone.0266004.g006

PLOS ONE
according to different parameters sets (having fixed α = 1 since in this case we have no interest in a uniform restriction of the number of selected points). We then perform a Principal Components Analysis (PCA) of the Zernike vectors associated to each binding site, in order to compact the information into the essential space composed of the first two principal components (whose explained variances are 0.54 0.57, 0.51 and 0.64). Fig 7 shows the surface points belonging to the binding site (in blue) and the subset selected by the algorithm with the optimal For each protein, the "β", "γ" and "δ" columns report the best combination of parameters. The "Sampled points" column shows the percentage of selected points on the surface, while the last column "Missed points" displays the percentage of points that are lost (if the value is positive) or added (if it is negative) when the sampling is performed with the mean parameters β = 0, γ = 4, δ = 5. https://doi.org/10.1371/journal.pone.0266004.t002

PLOS ONE
parameters (in red), for each one of the four surfaces. The fourth column shows the same points coloured according to their R i ; as expected, the sampling is more dense where the surface is more rough. Next, we compute the percentage of space occupied in the PCs plot by the selected points compared to that spanned by the total points. The results are shown in the third column of Fig 7. The optimal parameters samplings (marked in red) correspond to the situation from which a higher number of sampled points starts conveying less information in the representation of the binding regions in terms of the Zernike vectors space.

Conclusion
In this paper, we present a new sampling method to reduce the number of points needed to describe a molecular surface. Given the importance of the definition of the molecular surface for the study of the interactions between molecules, we design a new algorithm to efficiently select the points of the original surface that minimize the loss of information in terms of describing the local shape of surface regions, which can potentially interact with a molecular partner. To test its performance, we use a recently developed method based on 2D Zernike polynomials, capable of describing the shape properties of portions of molecular surfaces. By means of Zernike, we verify if the patches centered around the sampled points are indeed the most representative of the surface. The sampling of these points is determined by Eq 7 and, when the best parameters are selected, results in a satisfactory reconstruction of the surface that needs only a subset of the total surface's points.
To determine how accurate the method is in selecting a subset of surface points, we compare the results of the sampling with what is obtained with a random selection of the same number of points. This is done in terms of the mean distance between the Zernike vectors that describe a patch containing all the surface points and the ones that describe the same region, but containing this time only the points selected with the sampling or randomly.
We presented our novel sampling strategy on a case study of a small protein and then extended it on a set of 70 proteins, comparing the results coming from the optimal sampling with those obtained with the average value of the parameters.
Finally, to verify the applicability of our procedure, we showed the sampling ability to characterize binding sites of protein complexes in a set of cases. Thanks to the reduction of the points that have to be considered, the computational cost of the following analysis on the surfaces is reduced. Therefore, the method can be applied to (i) macromolecules composed of a high number of residues, (ii) the analysis of the molecular surface of frames obtained from molecular dynamics simulations, and (iii) molecular surfaces calculated at high resolution for which a high number of points is required for a more detailed description. Moreover, in principle the method can be adapted to other properties besides curvature, such as electrostatics or hydrophobicity, since these are described with numerical values, that can be assigned to each surface point. This would mean sampling the points maximizing the conservation of different properties (in addition to morphological ones) and we are working on this, since the Zernike formalism can be applied to describe any function.